1 Overview


Trained 7 machine learning models and one dummy classifier. The latter always predicts the most frequent class. This could provide some meaningful comparisons, for example if we plotted the difference between a model and the dummy. Although it is removed for most graphs for now.

The seven models, and their 2/3 letter codes are shown below.

DUM - Dummy
GNB - GaussianNaiveBayes
GB - GradientBoosting
KNN - KNearestNeighbours
LSCM - LinearSVC
LG - LogisticRegression
RF - RandomForest
SVM - SupportVectorMachine

The data consisted of all the hypertensive subtypes (excluding the healthy control), and not all the microRNA variables were used. First, they were filtered via Recursive Feature Elimination (RFE), where the base model was a Gradient Boosted Tree. This left 131 features, although the RFE could be improved by using an ensemble of various models, as we run the risk of over-fitting the Gradient Boosted Trees. For example, when we used Random Forest for RFE, it selected only 87 features. This is not an insignificant difference and should be explored in a bit more detail (for a later date though).

Six of the seven types of models (not Gaussian Naive Bayes) were trained across a wide range of hyperparameters to obtain a collect of models of varying quality. Each model was evaluated with a battery of metrics which can be compared in two main groups. Firstly, we can look at the metric type, like precision, recall, F1-score or accuracy. Secondly, we could look at how it generalises to the multiclass problem, either through Micro, Macro or Weighted averaging of the values from a confusion matrix, i.e the averaging method . Some metrics natively handle the multiclass setting, like accuracy, Matthew Correlation Coefficient (MCC) and Cohen Kappa. Here, these are described to have an averaging method of “None”, although “NA” might be more clear. More precisely, most metrics are formed as a combination of an averaging method and a metric type, like Macro Precision, Micro F1-score.

Micro-averaging, which sums the False Positives, True Positive (TP), etc, across each class then performs the calculation. For recall, all the True positives are summed and then divided by the sum of True Positives and False Negatives (FN), ie TP / (TP + FN). In comparison, Macro-averaging calculates the metric for each class, then obtains the means. Concretely, Macro-Recall would find the recall for each class first, then find the mean.

1.1 Distribution of Metric Values

2 Comparison of Metrics relative to the Mean


In order to compare the metrics to each other across the various models, the mean score of each model was found then obtained metric was divided by this mean. Consequently, a relative value of 1 indicates that metric is identical to the mean while values less than one indicate it is typically smaller than the mean (likely suggesting it is more conservative). Values greater than one can be interpreted as the opposite, indicating a more optimistic metric in general.

On the X-axis we have all of model type and on the Y-axis we have the relative mean metric score. The values are coloured by their metric type in the first graph and their averaging method in the second. The latter has much clearer grouping, indicating the averaging method is very influential and have different levels of optimism/bias.

It is also worth pointing out that the most conservative metrics appear to be MCC and Cohen Kappa, which naturally generalise to the multiclass setting.

2.1 Comparison of Metric Type

2.2 Comparison of Averaging Method

3 Comparison of Mean and Standard Deviation of Each Metric


3.1 Colour by Metric Type

With the relative mean and relative standard deviation on the X and Y axis, respectively, the metrics scores appear to cluster by their metrics, instead of by the model they came from.

3.2 Colour by Averaging Method

It is quite nice to see the clear clusters in the different the averaging methods. This likely indicates this is an important choice because of how different.

3.3 Colour by Model Type

This was to illustrate the metrics cluster with values of the same metric rather than from the same model. If they clustered by model then comparisons would likely not be possible.

4 Comparison of Models using Dummy Classifer


The Dummy classifier is configured to always predict the majority class, meaning it does not employ any information from the features. If a dummy classifier is achieving large values in a metric, then that metric is likely sensitive and over-emphasising the performance of the model due to the most frequent label. If the majority class was even larger, we would expect this metrics to equally become larger too. A metric that performs poorly for the dummy classifier is likely useful, as it displays a genuine model has made a considerably improvement that is not due to the imbalance of the data. Metrics like Cohen’s Kappa and Matthew Correlation Coefficient, both have scores of zero, indicating they could be good candidates to assess the model with imbalanced data.

4.1 Metric Scores of Dummy

4.2 Absolute Difference

The metric score for a model subtract the score from the dummy.

4.2.1 By Averaging Method

4.2.2 By Metric

4.3 Relative Difference

The metric score of the model divided by the score from the dummy. A score of 1 is achieved when the model and dummy have the same value. Larger scores indicate larger positive changes in the metric score.

4.3.1 By Averaging Method

4.3.2 By Metric

4.4 Relative to Maximum Difference

The difference between the model and the dummy, divided by one minus the dummy score. For example, if the dummy has a score of 60%, that means the maximum increase that is possible is 40%. So we take the difference and divide by this maximum to see how much the model improves. A score of 1 denotes the model achieved 100% of the possible increase.

4.4.1 By Averaging Method

4.4.2 By Metric

5 Comparison of Models Using Metrics*


*Note: these graphs are not in their final form

The five folds used in the cross validation are the same across all the models, meaning a comparison is more appropriate compared to if the folds were completely random. It is clear the best type of model is the Gradient Boosted Tree whilst Gaussian Naive Bayes appears the weakest. This could be partially explained by Gradient Boosting having the most number of individual models, due to having several hyperparameters. The hyperparameter tuning of these models have yet to be finished. Currently, a wide range of hyperparameters have been explored but a more narrow and focused scope could improve model performance.

5.1 Micro Averaging

5.1.1 Mean Score

5.1.2 Median Score

5.1.3 Max Score

5.2 Macro Averaging

5.2.1 Mean Score

5.2.2 Median Score

5.2.3 Max Score

5.3 Weighted Averaging

5.3.1 Mean Score

5.3.2 Median Score

5.3.3 Max Score

5.4 No Averaging

5.4.1 Mean Score

5.4.2 Median Score

5.4.3 Max Score

6 Comparison of Top Model Using Metrics


During the hyperparameter tuning stage, multiple models were created. However, not all types of models created the same number of models, as some forms of models have limited hyperparameters to tune, therefore there is less combinations of hyperparameters to explore. This means for models like Random Forest and Gradient Boosting Trees, which have several hyperparameters, are more likely to have a better model because they have more models. Additionally, a practitioner is not interested in the mean performance of a group of models, but the maximum score achieved by an individual model to select to appropriate model or hyperparameters. Therefore, by selecting the best 10 models allows a more fair and realistic comparison between the models. This also prevents poorly configured models to be ignored, and not underestimate the performance of the models. Since we are using several metrics, we select the top 10 scores for every metric. Then, we obtain the other metrics for this model - to see how the best accuracy compares to the best macro-precision, for example.

6.1 Overall Metric Behaviour

Since the relative metric scores are based on the pool of metric values achieved by all the models, and we have removed the majority of the models, recalculating the relative scores should help ensure the behaviour highlighted previously, is consistent in our subset here.

6.1.1 Relative Metric Scores (by Average Type)

6.1.2 Relative Metric Scores (by Metric Type)

6.1.3 Mean Metric Scores (by Metric Type)

6.2 Spread of Metric Values

6.2.1 Simply Showing Points

6.2.2 Linerange Plot (By Model)

I wanted to display the above graphs in a difference way. I think another important thing to consider is the variation in the values from the metrics between the different folds. Looking at the metrics that do not require an averaging method, they appear less variable while the three other types of averaging look more variable. Plotting the mean value across the folds, with the maximum and the minimum would allow us to compare the variability

6.2.3 Linerange Plot (By Average Type)

6.2.4 Linerange Plot (By Metric Type)

6.3 Performance Across the Folds

6.3.1 Micro Averaging

6.3.2 Macro Averaging

6.3.3 Weighted Averaging

6.3.4 No Averaging

6.4 Heatmap of Maximum Values

7 Correlation of Metrics


7.1 Pearson’s r

7.2 Spearman’s r

7.3 Pearson’s r (top models)

This uses the subset of the best models.

7.4 Spearmans’s r (top models)

This uses the subset of the best models.

8 Confusion Matrix


8.1 All Models

TP - True Positive
FP - False Positive
TN - True Negative
FN - False Negative
PRECISION - Precision
RECALL - Recall
F1_SCORE - F1-Score
NPV - Negative Predictive Value
FDR - False Discovery Rate
FOMR - False Omission Rate

8.1.1 Confusion Matrix Values for Each Model

8.1.2 Precision and Recall

8.1.3 F1-Score

8.2 The Top Models

8.2.1 Confusion Matrix Values for Each Model

8.2.2 Precision and Recall

8.2.3 F1-Score

8.3 Testing Heatmaps for the Values from the Confusion Matrix

Not sure this is useful!!